This is the first post of a trilogy focused on code optimization.
I’ll draw from my experience developing scientific software in both academia and industry to share practical techniques and tools that can help you streamline your R workflows without sacrificing clarity. By the end of this trilogy, you’ll have a solid understanding of the foundational principles of code optimization and when to approach it—without overcomplicating things.
I hope the principles presented here will help you write code that’s not just faster but also sustainable and practical to maintain.
Let’s dig in!
Whether you’re playing with large data, designing spatial pipelines,
or developing scientific packages, at some point everyone writes a
regrettably sluggish piece of junk code.
The simplest way to make it run is simple enough: throw more money at your cloud provider, or upgrade your rig, and get MORE!
More cores, more
RAM, more POWER! Because who doesn’t love bragging about that shit? I
surely do!
On the other hand, money is expensive (duh!), and computing
has a serious environmental footprint. With this in mind, it’s also
good to remember that we went to
the Moon and back on less than 4kb of RAM, so there must be a way to
make our junk code run in a more sustainable manner.
This is where code optimization comes into play!
Optimizing code is not about making it faster, it is about making it efficient for developers, users, and machines alike. For us, pitiful carbon-based blobs, readable code is easier to wield, ergo efficient. For a machine, efficient code runs fast and has a small memory footprint. And there is an inherent tension there! Optimizing for computational performance alone often comes at the cost of readability, while clean, readable code can sometimes slow things down. That’s why optimization requires making some strategic choices. Before diving headfirst into code optimization, it’s crucial to understand the dimensions of code efficiency and when code optimization is actually worth it.
Code efficiency is an abstract concept involving a complex web of causes and effects. Unveiling the whole thing bare here is beyond the scope of this post, but I believe that understanding some of the foundations may help articulate successful code optimization strategies.
Let’s take a look at the diagram below.
On the
left, there are several major code features that we can tweak
to improve (or worsen!) the efficiency of our code. Changes in these
features are bound to shape the efficiency landscape of our
code in often divergent ways.
The following code features fundamentally shape how efficient our programs can be:
Each of these features will be explored in detail in the next articles of this trilogy.
These foundational choices impact three key performance dimensions:
At a higher level, two emergent properties arise:
Code optimization is a multidimensional trade-off. Improving one aspect often affects others. For example, speeding up execution might increase memory usage, parallelization can create I/O bottlenecks. There’s rarely a single “best” solution, only trade-offs based on context and constraints.
If for some reason you find yourself in the conundrum expressed in the title of this section, then you might find solace in the First Commandment of Code Optimization.
“Thou shall not optimize thy code.”
Also known in some circles as the YOLO Principle, this commandment reveals the righteous path! If your code is reasonably simple and works as expected, you can call it a day and move on, because there is no reason whatsoever to attempt any optimization. This idea aligns well with a principle enunciated long ago:
“Premature optimization is the root of all evil.” — Donald Knuth - The Art of Computer Programming
Premature optimization happens when we let performance considerations get in the way of our code design. Designing code is a taxing task already, and designing code while trying to make it efficient at once is even harder! Having a non-trivial fraction of our mental bandwidth focused on optimization results in code more complex than it should be, and increases the chance of introducing bugs.
That said, there are legitimate reasons to break the first commandment. Maybe you are bold enough to publish your code in a paper (Reviewer #2 says hi), releasing it as package for the community, or simply sharing it with your data team. In these cases, the Second Commandment comes into play.
“Thou shall make thy code simple.”
Optimizing code for simplicity isn’t just about aesthetics; it’s about making it readable, maintainable, and easy to use and debug. In essence, this commandment ensures that we optimize the time required to interact with the code. Any code that saves the time of users and maintainers is efficient enough already!
This post is not focused on code simplicity, but here are a few key principles that might be helpful:
Use a consistent style: Stick to a recognizable style guide, such as the tidyverse style guide or Google’s R style guide.
Avoid deep nesting: Excessive nesting makes code harder to read and debug. This wonderful video makes the point quite clear: Why You Shouldn’t Nest Your Code.
Use meaningful names: Clear names for functions, arguments, and variables make the code self-explanatory! Avoid cryptic abbreviations or single-letter names and do not hesitate to use long and descriptive names. The video Naming Things in Code is a great resource on this topic.
Limit number of function arguments: According to Uncle Bob Martin, author of the book “Clean Code”, the ideal number of arguments for a function is zero. There’s no need to be that extreme, but it is important to acknowledge that the user’s cognitive load increases with the number of arguments. If a function has more than five args you can either rethink your approach or let the user pay the check.
Beyond these tips, I highly recommend the book A Philosophy of Software Design, by John Ousterhout. It helped me find new ways to write better code!
At this point we have a clean and elegant code that runs once and gets the job done, great! But what if it must run thousands of times in production? Or worse, what if a single execution takes hours or even days? In these cases, optimization shifts from a nice-to-have to a requirement. Yep, there’s a commandment for this too.
“Thou shall optimize wisely.”
At this point you might be at the ready, fingers on the keyboard,
about to deface your pretty code for the sake of sheer performance. Just
don’t. This is a great point to stop, go back to the whiteboard, and
think carefully about what you want need to do. You
gotta be smart about your next steps!
Here there are a couple of ideas that might help you get smart about optimization.
First, keep the Pareto Principle in mind! It says that, roughly, 80% of the consequences result from 20% of the causes. When applied to code optimization, this principle translates into a simple fact: most performance issues are produced by a small fraction of the code. From here, the best course of action requires identifying these critical code blocks (more about this later) and focusing our optimization efforts on them. Once you’ve identified the real bottlenecks, the next step is making sure your optimizations don’t introduce unnecessary complexity.
Second, beware of over-optimization. Taking code optimization too far can do more harm than good! Over-optimization happens when we keep pushing for marginal performance gains at the expense of clarity. It often results in convoluted one-liners and obscure tricks to save milliseconds that will confuse future you while making your code harder to maintain. Worse yet, excessive tweaking can introduce subtle bugs. In short, optimizing wisely means knowing when to stop. A clear, maintainable solution that runs fast enough is often better than a convoluted one that chases marginal gains.
Beyond these important points, there is no golden rule to follow here. Optimize when necessary, but never at the cost of clarity!
In the next article of this trilogy, we’ll dive deeper into the code features that shape efficiency: programming languages, simplicity and readability, and algorithm design. We’ll explore how high-level design choices impact the performance and maintainability of your code.
Stay tuned!